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Abstract. Dimension reduction and variable selection are performed 
routinely in case-control studies, but the literature on the theoretical 
aspects of the resulting estimates is scarce. We bring our contribution 
to this literature by studying estimators obtained via £i penalized like- 
lihood optimization. We show that the optimizers of the £i penalized 
retrospective likelihood coincide with the optimizers of the £i penalized 
prospective likelihood. This extends the results of Prentice and Pyke 
(1979), obtained for non-regularized likelihoods. We establish both the 
sup-norm consistency of the odds ratio, after model selection, and the 
consistency of subset selection of our estimators. The novelty of our 
theoretical results consists in the study of these properties under the 
case-control sampling scheme. Our results hold for selection performed 
over a large collection of candidate variables, with cardinality allowed to 
depend and be greater than the sample size. We complement our theo- 
retical results with a novel approach of determining data driven tuning 
parameters, based on the bisection method. The resulting procedure of- 
fers significant computational savings when compared with grid search 
based methods. All our numerical experiments support strongly our 
theoretical findings. 



1. Introduction 

Case-control studies investigate the relationship between a random out- 
come Y, typically the disease status, and a number of candidate variables 
Xj, 1 ^ J ^ M, that are potentially associated with Y. An important 
instance is provided by cancer studies, where the -'^^j's may quantify ex- 
posures to certain substances, or may be a collection of genes or genetic 
markers. One of the problems of interest in such studies is the identifica- 
tion, on the basis of the observed data, of a smaller subset of the set of 
candidate variables, that can reliably suffice for subsequent analyses. This 
problem becomes more challenging when the collection of potential disease 
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factors M is very large. Our solution to this problem is variable selection 
via penalized likelihood optimization. We propose below a computationally 
efficient selection method and we investigate the theoretical properties of 
our estimates under the case-control data generating mechanism. We begin 
by giving the formal framework used throughout the paper and by making 
our objectives precise. 

We consider binary outcomes Y S {0, 1}, where y = labels the controls 
(non-disease), and Y = 1 labels the cases (disease). Let fo{x), fi{x) be, re- 
spectively, the conditional distributions of X = {Xi, . . . ,Xm) given 1" = 
and Y = 1. Under the case-control or retrospective sampling scheme we ob- 
serve two independent samples from each of these conditional distributions. 
Formally, we observe 

(1.1) . . . , i.i.d. with density /o(x), 

Xl . . . , X^_^ i.i.d. with density /i(x), 

where the superscripts and 1 of X are mnemonic of the fact that the 
samples correspond to y = and y = 1, respectively. For simplicity of 
notation we assume that uq = ni = n. We assume that the outcome Y is 
connected to X via the logistic link function 

(1-2) p(y = i|x = x) = /"P('^° + ^^"\ , 

where /? G M^^ <5o G M. 

In this article we will further assume that /3 = /?*, where /3* € has 
non-zero components corresponding to an index set /* C {1, . . . , M} with 
cardinality |/*| = k* , possibly much smaller than M. The variable selection 
problem can be therefore rephrased in this context as the problem of esti- 
mating the unknown set I* from data generated as in (jl.ip . We note that 
this problem is not equivalent with the problem of estimating /* from i.i.d. 
pairs (Xi,Yi), with Yi generated from (jl.2p . as no random samples from the 
distribution of Y are available under the sampling scheme (jl.ip . However, 
the likelihoods corresponding to the two sampling schemes are intimately 
related, with results detailing their connections dating back to the 1970s. 
This link is essential for our procedure and we illustrate it below. Following 
Prentice and Pyke (1979), Section 3, an application of the Bayes' theorem 
combined with rearranging terms gives the following re-parametrization of 
/o and /i, respectively: 

(1.3) /o(x) = 2x— 1 xq{x)=:2po{x)q{x), 

1 -I- exp(o -I- p'x) 

„ / N exp((5 -|- l3'x) , , . ^ / N 

where (5 is a new intercept parameter, different than 5o, f3 is given by (jl.2p . 
and q{x) is a positive function that integrates to one. The parameters 6, 
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P and q are constrained by the requirement that /o and /i are probability 
densities, that is 



Therefore, the Ukehhood function corresponding to data generated as in 
(ll.ip . to which we wih refer in the sequel as to the retrospective likelihood, 
is: 



(1.5) Lretro{S,P,q) =: nti/o(a;°)/i(xi) 

= {nr=iPo(x°)pi(xi)} X {4ntiQ(x°)g(xi)} 
= : Lpros{S,P) X L{q), 



with parameters S,P and q related via the constraint (II. 4p . Notice that 
Lpros is, up to the intercept, exactly the standard logistic regression like- 
lihood, had we observed 2n i.i.d observations (Xi,Yi), with equal number 
of and 1 responses 1^ generated according to (|1.2|) : this quantity and the 
corresponding sampling scheme are typically referred to in this context as 
the prospective likelihood, and prospective sampling scheme, respectively. We 
will also use this terminology below. 

The earlier results on the estimation of /3 via (II. 5p did not address the 
model selection problem and were mostly concerned with the asymptotic 
properties of the estimates of /3. Anderson (1972), Prentice and Pyke (1979), 
Farewell (1979) are among the pioneers of this work and showed that: 

(1) The vector /3 G that maximizes the retrospective likelihood Lj-etroi^-, /?) 
under the constraint (|1.4p coincides with j3 G M^-' that maximizes the 
prospective likelihood Lprosi^, P)', 

(2) The asymptotic distribution of /?, derived under the sampling scheme 
(II. ip coincides with the asymptotic distribution of /3 derived under the 
prospective sampling scheme. 

A number of important works continued this program, and provided in 
depth analyses of various other aspects of the estimators of /? in retrospective 
studies. We refer to Gill, Vardi and Wellner (1988) and Carroll, Wang and 
Wang (1995), for more general sampling schemes, to Qin and Zhang (1997) 
for goodness of fit tests, to Breslow, Robins and Wellner (2000) for a study 
of the efficiency of the estimators, to Murphy and van der Vaart (2001), for 
partially observed covariates, to Osius (2009), for general semiparametric as- 
sociation models and to Chen, Chatterjee and Carroll (2009), for shrinkage 
methods tailored to inference in haplotype-based case-control studies and 
the asymptotic distribution of the resulting estimators. The variable selec- 
tion problem was not considered in any of these works. Although model 



(1.4) 




4 



BUNEA AND BARBU 



selection techniques are routinely used in case-control studies, and are typi- 
cally based on testing via the asymptotic distribution of /?, we are unaware 
of theoretical analyses of the performance of the resulting estimators under 
this sampling scheme. 

Our contribution to this literature is to provide answers to the model 
selection analogues of (1), and to formulate goals that replace (2) above by 
goals targeted to the dimension reduction and selection aspects. Specifically, 
we propose a model selection method based on a penalized likelihood ap- 
proach, with a sparsity inducing penalty. In this article we will focus on the 
li penalized likelihood with tuning parameter A. We will show the following: 

(I) For any penalty function pen{(3) that is independent of (5, the maximizer 
of Lretroi^, P) + pen{f3) under the constraint (jl.4p coincides with the maxi- 
mizer of the prospective likelihood Lprosi^, P) +pen{(3). 

(II) For pen{(3) = A^*;^^ |/3j|, we obtain estimators / of /* and dimension 
reduced estimators (3 of {3* by optimizing Lpros{^,P) +pen{f5). Then: 

(a) The behavior of P(/ = I*), analyzed under the sampling scheme (jl.ip 
is essentially the same as the behavior of P(/ = /*), evaluated under the 
prospective sampling schemes. 

(b) The estimator (3, analyzed under the sampling scheme (jl.ip . adapts 
to the unknown sparsity of (3*, which parallels the same property that can 
be established for /3, under the prospective sampling schemes. 

The result announced in (I) is an immediate extension of result (1) above 
established in the 1970s for the unpenalized likelihood. We present it in 
Section 2 below. The results in (II) necessitate an analysis that is completely 
different than the one needed for (2), and we discuss (a) in Section 4 and 
(b) in Section 3. The immediate implication of (b) that is relevant to case- 
control studies is the fact that the estimator (3, which is supported on a 
space of potentially much lower dimension than the original yields sup- 
norm consistent estimates of the odds ratio, where the odds ratio is defined 
as follows. The odds of having Y = 1 for an individual with characteristics 
X = X are 0{x) = P{Y = 1\X = x)/P{Y = 0|X = x), and the odds ratio 
is defined as 0{x)/0{xq), for some reference characteristic X = xq. Under 
model (|1.2p . the odds ratio becomes 

(1.6) R{x) =: exp(/3'(x - xq)). 

We establish the after model selection consistency of estimators of this ratio 
in Section 3 below. 
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In this article we will concentrate on the analysis of estimators obtained 
by optimizing ii penalized criteria. The literature on the theoretical as- 
pects of such estimates has seen an explosion in the past few years, together 
with the development of efficient algorithms for computing them. The re- 
sults pertaining to generalized linear models are most closely connected to 
our work, and all of them have been established for what we termed above 
the prospective sampling scheme, that is for data consisting of {Xi,Yi) i.i.d. 
pairs. For analyses conducted under this framework, we refer to van der 
Geer (2008) and Bunea (2008), for sparsity oracle inequalities for the Lasso, 
and to Bunea (2008), for correct subset selection; we refer to Ravikumar, 
Wainwright and Lafferty (2008) for related results on graphical models for 
binary variables. Motivated by the increasing usage of the Lasso type es- 
timators for the analysis of data arising from case-control studies, see e.g., 
Shi, Lee and Wahba (2007), Wu et al (2009) and the references therein, we 
complement the existing literature on this type of estimators by providing 
their theoretical analysis under the case-control data generating mechanism 
p.ip . To the best of our knowledge, this is the first such analysis proposed 
in the literature. 

The rest of the paper is organized as follow. Section 5 complements the 
theoretical results of Sections 2 - 4, by providing a fast algorithm for finding 
a data driven tuning parameter for the ii penalized optimization problem. 
Our procedure does not use a grid search. Instead, we use a generalization of 
the bisection method to compute tuning parameters that yield, respectively, 
estimators with exactly k = 0,1,..., M non-zero components. We then 
use a 10-fold cross-validated logistic loss function to which we add a BIC- 
type penalty to select one of these estimators. The corresponding procedure 
is easy to implement and offers important computational savings over a 
grid search based procedure, which can be 50 times slower, for the same 
degree of accuracy. Section 6 contains a detailed analysis of the proposed 
estimators, via simulations. It supports very strongly all our theoretical 
findings. Section 7 is a conclusion section, summarizing our findings. All 
the proofs are collected in the Appendix. 

2. Penalized logistic regression for case-control data 

In this section we investigate the type of penalty functions for which esti- 
mation of P via penalized retrospective log-likelihood optimization reduces to 
the estimation of (3 via penalized prospective likelihood optimization. Recall 
that we mentioned in (jl.Sp above that the retrospective likelihood can be 
written as the product of the prospective likelihood and a term depending 
on q: 

Lretro{S,(3,q) =: Lpros{S, P) X L{q) , 

where the parameters 5, (3 and q are constrained by (II. 4p . Let pen{l3) be a 
function that depends on /3 alone, and is independent of 6 and q. Define the 
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unconstrained maxima 

(2.1) (6, P) = axgmax{log Lpros{S,P) +pen{P)} ; g = argmaxlogL(g), 

S,I3 q 

where the second maximum is taken over all density functions q. The follow- 
ing result, proved in the Appendix, is an immediate extension of the result 
derived by Prentice and Pike (1979) for the unpenalized likelihood. 

Lemma 2.1. Let pen(P) be any function independent of 6 and q. Let 6,l3 
and q he given by 112.1]) . Then 

{(5,P),q) = axgmax{logLretro{S,P,q) +pen(/3)}, 
where the maximizer is computed over all 6, /?, q satisfying the constraint 



Lemma l2.ll shows that the penalized log- likelihood estimates of (5,/?), for 
a likelihood corresponding to data generated as in (jl.ip coincide with the 
penalized prospective log-likelihood estimates, which we rescale by 2n: 

1 

in| — ■ 

S,I3 



(2.2) {5, = mg mm{-— log Lpros{S,P) +pen{(3)} 



argmin(- ^log(l + e'^^''''') - ^Y.^^ + [5' X}) 

^'/^ i=i 1=1 



1 " 

+— ^log(l + e^+'3'^°) +pen(/3)). 



2n . , 

Lemma |2 . 1 1 holds for any function pen{P), as long as it is independent of 
6 and q. Since we are interested in dimension reduction, we will consider 
a sparsity inducing penalty. Throughout this paper our estimates will be 
obtained via (|2.2p with the ii penalty given below 

M 

(2.3) pen{l3) = X^lPjl 

for a tuning parameter A that will be made precise in the following section. 

3. Consistent estimation of the odds ratio 
after variable selection 



Recall that the true odds ratio (jl.6p is given in terms of /?*, which is sup- 
ported on a space of dimension k*, possibly much smaller than M. We show 
that the estimated odds ratio based on the selected variables correspond- 
ing to the non-zero elements of /? given by ()2.2p above, for penalty ()2.3p . 
provides a consistent estimate, in the supremum norm, of the odds ratio: 

(3.1) sup I exp P'{x — xq) — exp (3*'{x — xo)\ — > 0, 
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with probability converging to one. For simplicity of notation, we assume 
in what follows that xq = 0. For uniformity of notation, we also denote the 
intercept parameter given in (|1.3p by 6* . 

Our arguments are based on the following central fact, that may be of 
independent interest. Define the functions 

loie) =: ioie;x) = log(l + e^'^), 
4(0) =: 4(0; x) = log(l + e^'^) - 

and recall the notation 6 = {S,P). In order to write what follows in a 
compact way we further denote by P'' and the probability measures cor- 
responding to the densities /o and /i defined in (II. Sp . respectively. For a 
generic function g we write P'^^ and P^^f for integration with respect to P*^ 
and P^, respectively. With this notation we define the quantity 

(3.2) A(0,r) = V [io{9) - £o{e*)) + V (4(0) -4(r)) . 

Theorem 13.11 below, which is central to our paper, establishes that the dif- 
ference A is small. The proof, given in the Appendix, uses the control of 
appropriately scaled empirical processes corresponding to the two samples. 

If A ((5, /?), {6* ,13*)^ is small with high probability, relatively standard ar- 
guments can be used to show that 

sup\(3'x- P*'x + 6-6*\ 

X 

is also small, with high probability, and we establish this in Corollary 13.21 
below. 

To arrive at our desired result ()3.ip . we need to complement Corollary [32] 
with a study of the difference \6—5* \ . This is done in Theorem l3.3^ which also 
contains a stronger result: it shows that /? adapts to the unknown sparsity 
of P* , in that it is a consistent estimator of /3, with the rate of convergence 
of an estimate based only on /* variables. 

The combination of Theorem 13.11 Corollary 13.21 and Theorem 13.31 yields 
the desired sup-norm consistency of the odds ratio stated in Theorem 13.41 
All proofs are collected in the Appendix. All our theoretical results will 
be proved under the assumption that the design variables and the true pa- 
rameter components are bounded. We formalize this in Assumption 1 below. 

Assumption 1. 

(i) There exits a constant L > 0, independent of M and n, such that 
\^i \ ^ L, for all i and j, with probability 1. 

(ii) There exits a constant B > 0, independent of M and n, such that 
maxj\p*\ < B; \6*\ < B. 
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Let dn be any sequence converging to zero with n. Define the tuning sequence 



. , , , /21og2(M Vn 1 /21ogT- 

(3.3 r = logn 6LW ^ ^ + 7777 ^" 

^ ' ^ I V n 4(M Vn) V »^ 

li 5n = 1/n and M is polynomial in n, the tuning sequence r is of the order 
log "■Viog ^ ^ fpj^g results of this section will be relative to estimators obtained 
via the penalty (12. 3p . with tuning parameter A = 2r. For compactness of 
notation, we let 6 =: (?,^) G M^^+S 6** =: {6* , (3*) £ R^^+K Whenever we 
use this compact notation, we will also use the notation 6'u or 9*'u, for some 
vector u E M^-^^^^ that will be implicitly assumed to have the first component 
equal to 1. 

Theorem 3.1. Under Assumption 1, if k*r 0, then for any a > 

F{A{9,e*) > a) ^0, 

as n — > cxD. 

We give an immediate corollary of this theorem which establishes the sup- 
norm consistency of 9'x. It is interesting to note that both Theorem 13.11 
above and the corollary below hold under no assumptions on the dependence 
structure of the design variables. 

Corollary 3.2. Under Assumption 1, if k*r — > 0, then for any 7 > 
¥ (^supliP'x - P*'x) + (5 - 6*)\ — >0, 

as n ^ CO. 

The following theorem establishes rates of convergence for the estimates 
of 5* and (3* . It requires minimal conditions on the design variables. We 
formalize them below. Let y be a (M + 1) x (M + 1) matrix containing a 
{k* + 1) X [k* + 1) identity matrix, corresponding to I* and the intercept, 
and with zero elements otherwise. Let Xi, . . . , X„ denote X^, . . . , X^, and 
let Xn+i, ■ ■ ■ , X2n denote X\, . . . , X^. By convention, each X-i is viewed as 
a vector in M*^"''^ with the first component equal to 1, i.e. Xq^ = 1, for all 
i. We used this compact notation to avoid unnecessary superscripts. Let S 
be the (M + 1) x [M + 1) sample Hessian matrix with entries 

^ 2n 

— y^PQ{Xi)pl{X,)Xk^XJ,, 0<j,k<M. 

i=l 

Condition H. There exists < 6 < 1 such that P(i; - 6^ > 0) = 1. 

Remark. Condition H is a mild condition on S, as it requires that this ma- 
trix remain semi-positive definite after a slight modification of some of its 
diagonal elements, those corresponding to /*. This relaxes the conditions 
needed for the consistency of the estimators based on the non-regularized 
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log-likelihood, when one requires that the Hessian matrix be positive defi- 
nite, see e.g. Prentice and Pyke (1979). 

Let c = 6/bw, for b given by Condition H above and for some positive 
number w that is arbitrarily close to 1. 

Theorem 3.3. Under Assumption 1 and Condition H, if k*r 0, then 
(i) P(^|?-(5*| < crk*^ — > 1, 

{ii) Fi^J2\P,-P*\<crk*j -^1, 

as n ^ oo. 

The theorem shows that the rate of convergence of /? adapts to the un- 
known sparsity of /?*: X^jli \Pj — Pj \ has M terms, so we expect its size to 
be equal to the optimal rate of each term multiplied by M. Theorem 

13.31 shows that in fact /? behaves like an estimator obtained in dimension k* , 
had this dimension been known, as its rate of convergence in the ii norm 
is, up to constants and logarithmic terms, multiplied by k* , for our 

choice of r. 



Theorem 13.31 is the result announced in 11(b) of the introduction: the es- 
timators of (3* analyzed under the retrospective sampling scheme exhibit 
the same adaptation to sparsity as those analyzed under the prospective 
sampling scheme. For a full analysis of the £i penalized logistic regression 
estimates based on i.i.d. {Xi, Yi) pairs, with Yi generated as in ()1.2p we refer 
to Bunea (2008), for results obtained under conditions similar to Condition 
H on the Hessian matrix, and to van de Geer (2008), for results on gener- 
alized linear models obtained under conditions on the covariance matrix of 
the covariates. The difference between the results obtained under the two 
sampling schemes is essentially minor, and consists in a slight difference in 
the size of the tuning parameter r, which needs to be larger, by a logn 
factor, for the case control studies. This is a very small price to pay for not 
having full information on the joint distribution of X and Y. 

Corollarv l3.2l and Theorem l3.3l /^i) immediately imply, via a first order Taylor 
expansion, the desired result of this section. We summarize it in Theorem 
13.41 below which shows that for appropriate choices of the tuning parameters 
r, and if the size of the true model does not grow very fast relative to 1/r 
then, under minimal conditions on the covariates, we can estimate the odds 
ratio consistently. 

Theorem 3.4. Under Assumption 1 and Condition H, if k*r 0, then 
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P(sup \R{x) - R{x)\ < q) — >1, 

X 

for any a > 0, as n —> oo. 

4. Consistent variable selection 

In this section we investigate the consistency of the index set I corre- 
sponding to the non-zero components of the estimator /3 discussed above. 
We show that P(7* = I) — > 1 holds for the retrospective scheme, under 
conditions similar to those needed in the prospective sampling scheme. We 
state them below. 

Condition 1. There exists d > such that 

^1=1. 



max 

jei",kj^j 



1=1 / 



For po and pi defined in ()1.3p above, we assume that the following also holds. 



Condition 2. There exits d > such that 

2n 



max 

jei*,k^j 



■^Ylpo{Xi)pi{Xi)XijXik 

i=l 




1. 



Remark 1. The constants d in the two conditions above need not be the 
same; we used the same letter for clarity of notation. Remark 2. The two 
conditions above can be regarded as conditions guaranteeing the identifia- 
bility of the set of true variables /* . Condition 1 requires that there exists 
some degree of separation between the variables in /* and the rest, in that 
the correlations between variables in these respective sets are bounded, up 
to constants, by \/k* . If k* is small to moderate, the restriction is mild. 
Condition 2 reinforces Condition 1, by requiring that these variables remain 
separated even when separation is measured by the entries in the Hessian 
matrix, which can be regarded as weighted correlations. 

Let 5n be any sequence converging to zero with n. Define the tuning sequence 



N , . . /21og2 MVn 1 /21ogf^ 
(4.1 r = logn 6LW ^ + 7777 T + 4iV — 

and notice that the last term in this definition of r differs by a factor of 
\/\og M from the last term in r given in (13. 3p of the previous section. If 
6n = l/^^ and M is polynomial in n, the tuning sequence is again of the 
order '°g '^"^ - . The following assumption reflects the fact that we can only 
detect coefficients above the noise level, as quantified by r. 
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Assumption 2. miiijg/* > 4r. 

Theorem 4.1. Under Assumptions 1 and 2, if k*r and Conditions 1 
and 2 are met, then P(/* = I) — > 1, as n —> oo. 

Theorem 14.11 shows that /, analyzed under the retrospective samphng scheme, 
is a consistent estimator of /* , under essentially the same conditions needed 
for its analysis under the prospective sampling scheme, see Bunea (2008) 
and also Ravikumar, Wainwright and Lafferty (2008) for related results. 

5. Data based tuning sequences and the bisection method 

In Sections 3 and 4 above we showed that if the cardinality of the true 
model is not larger than ^/n, up to constants, the proposed method yields 
consistent estimation of the odds ratio and of the support of /?*, for tun- 
ing sequences r given in (|3.3p and (j4.ip . respectively. Since the constants 
involved in these expressions may be conservative, we complement our the- 
oretical results by offering in this section a fully data driven construction of 
the tuning parameters. Typical methods involve two main steps. In the first 
step one computes the regularization path of the solution to (j2.2p . as the 
tuning parameter r varies. Then, in a second step, one selects the appropri- 
ate value of r from a fine grid of values, by cross- validating the log-likelihood. 

We present here a method that follows in spirit this idea, but with two 
main modifications: we do not need to compute the regularization path, but 
rather a sketch of it and we choose r via a combined cross- validation/BIC 
-type procedure. We describe the resulting technique in what follows. 

We begin by noting that, unlike £i penalized least squares, the regulariza- 
tion path for the ii penalized logistic likelihood is not piecewise linear and 
it cannot be computed analytically, see e.g., Rosset and Zhu (2007), Koh et 
al. (2007). In this case, approximate regularization paths can be obtained 
from path following algorithms, see, e.g., Hastie et al. (2004), Park and 
Hastie (2006, 2007), Rosset (2005). These constructions rely on the follow- 
ing observation: the values of the tuning parameter r define a partition of 
the positive axis, such that for all r belonging to an interval of this parti- 
tion we obtain the same sparsity pattern. Thus, the full information on the 
sparsity pattern can be recovered from having a representative inside each 
such interval. 

We call the path corresponding to these representative values a "sketch" 
of the regularization path. The problem therefore reduces to finding the set 
of representatives. For this, one could perform a grid search and find the 
intervals where the sparsity pattern change. However, a grid search method 
may not include some of these intervals, as they can have arbitrary length. 
Thus, a grid search may skip some of the sparsity patterns that are in fact 
present in the regularization path. Figure [T] below offers an instance of this 
fact. 
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Figure 1. Approximate regularization path for normal iid 
data, 2n = 300, M = 15 and k* = 3. Left: the regularization 
path sketch of /3, obtained using GBM. Right: regularization 
path of P using a grid search with the same computational 
complexity as the GBM. 

It shows that a grid search with the same computational complexity as the 
generalized bisection method GBM, described in detail below, can fail to 
contain the true index set /* in the corresponding regularization path. In 
this example /* has three elements, which can be recovered in the left panel, 
and not in the right panel, as there is no value of r in the grid we considered 
for which we can have exactly three non-zero components. 

Our procedure replaces the grid search with a different method, that al- 
lows us to obtain, for each dimension 1 < k < M, a value of r for which 
the solution given by ()2.2|) has exactly k non-zero components. For each 
tuning parameter r, let the number of nonzero entries in Pr be denoted by 
n(r). For each dimension 1 < /c < M we let /ifc(r) = n(r) — k. Our method 
consists in finding r such that hk{r) = 0. To solve this equation we use 
the Bisection Method, e.g. Burden and Faires (2001), which is a well estab- 
lished computationally efficient method for finding a root z G R of a generic 
function h{z). The bisection method can be summarized as follows. Let a 
be the desired degree of accuracy of the algorithm. 

The Basic Bisection Method (BBM). 

Given a > 0, do: 

1. Choose zq,zi such that h{zQ)h{zi) < (i.e. h{zQ),h{zi) have differ- 
ent signs). 

2. If h{zo) = or h{zi) = stop. 

3. Take z = (zq + zi)/2. If \zi — zo\ < a then stop and return z. 

4. If h{z)h{zo) < 0, make zi = z. 

5. Else h{z)h{zi) < and make zq = z. 

6. Return to step 3. 
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We can apply the basic bisection method to solve hk{r) = 0, for each k, 
and obtain the desired values of the tuning sequence ri, . . . , vm- However, 
performing BBM for each dimension k separately is not computationally 
efficient, as there is a large amount of overlapped computation. In what fol- 
lows we propose an extension of the BBM that finds a sequence ro, 
such that n(rfc) = k, for all < A; < M. The extension uses a queue con- 
sisting of pairs {ri,rj) such that n(rj) < h{rj) — 1. 

The General Bisection Method for all k (GBM). 

Initialize all r, with —1. 

1. Choose ro very large, such that n(ro) = 0. Choose r„ = 0, hence 
n(r„) = n. 

2. Initialize a queue q with the pair (ro,r„). 

3. Pop the first pair (a, b) from the queue. 

4. Take r = {a + b)/2. Compute k = n{r). 

5. If Tfc = — 1 make = r. 

6. If \n{a) — k\>l and |a — r| > a, add (a, r) to the back of the queue. 

7. If |n(6) — k\>l and |6 — r| > a, add (r, b) to the back of the queue. 

8. If the queue is not empty, go to 3. 

The GBM described above offers a way of obtaining candidate values 
ro, . . . ,rM for the tuning parameter r. Observe that this method does not 
use anything specific to the £i -regularized logistic loss function. The GBM 
can be used in connection with any penalized loss function, for a sparsity 
inducing penalty. This makes the GBM method more general than the one 
proposed in Park and Hastie (2007), which is restricted to li- regularized 
generalized linear models. 

We performed a quantitative evaluation of the GBM on four problems 
of different difficulty levels. The data were simulated from a Normal dis- 
tribution, described in Section 6 below as NOR_IID. In Table [1] below we 
compared the GBM and a grid search in terms of their capability of con- 
structing approximate regularization paths containing the true /*. The per- 
centages reported in the table are percentages of time /* was in the path, 
over 250 simulations. Table [U indicates that if the set /* is in the regular- 
ization path, it will also be in the GBM path sketch, obtained at a low 
computational expense. 

To complete the selection procedure, we use the dimension stabilized p- 
fold cross-validation procedure summarized below. Let D denote the whole 
data set, and let D = Di U . . . U Dp be a partition of -D in p disjoint sub- 
sets, each subset containing the same percentage of cases and controls. In 
all the experiments presented in the next section we took p = 10. Let 
D-j = D \ Dj. We will denote by rj^ a candidate tuning parameter de- 
termined using the GBM on D-j. We denote by the set of indices 
corresponding to the non-zero coefficients of the estimator of /3 given by 
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Experiment 


Size 2n k* 


= \r\ 


GBM 


Grid 


GridxlO 


Grid X 50 


1 


100 


3 


70.0 


28.8 


64.8 


70.0 


2 


200 


3 


99.2 


79.2.7 


99.2 


99.2 


3 


300 


10 


88.4 


76.8 


87.2 


87.6 


4 


400 


10 


98.8 


93.2 


98.8 


98.8 


Table 1. 


Percentage 


of times the 


I* was 


present in 


differ- 



ent approximate regularization paths for the Normal dataset 
NOR_IID, M = 250. Grid has the same computational com- 
plexity as the GBM, while GridxlO and Grid x 50 are finer 
grids that are 10 respectively 50 times more expensive than 
the GBM. 



(j2.2p . for tuning parameter on D-j. We denote the unpenalized maxi- 
mum likelihood estimators corresponding only to the variables with indices 
in ll and computed on D^j by {5^ With log Lpros(<^i /?) defined in (j2.2p 
above, let L^, =: log Lpros{S^ , Pi) , computed on Dj. With this notation, the 
procedure becomes: 

Variable Selection Procedure. 

Given: a dataset D partitioned into p disjoint subsets: D = Di L) . . . L) Dp, 
each subset containing the same percentage of cases and controls. Let D-j = 
D \ Dj for ah j. 

1. For each 1 < k < M and each fold j of the partition, 1 < j < p: 

Use the GBM to find and such n(r^) = = /c on D^j. 

Compute Lj, =: log Lprosi^'' , f^l), as defined above. 

2. For each l<k<M: 

Compute Lfc =: i Yl%i ^i- 

3. Obtain , „ 

k = argminfLfc + 0.5A; ). 

k 2n 

4. With k from Step 3, use the BBM on the whole data set D to find 
the tuning sequence and then compute the final estimators using 
(j2.2p and this tuning sequence. 

Remark. In Section 4 above we showed that the tuning sequence required for 
correct variable selection needs to be slightly larger than the one needed for 
accurate estimation of the odds ratio. This suggests that cross-validation, 
which is routinely used for the latter purpose, is not the appropriate method 
for constructing a data driven tuning sequence for accurate variable iden- 
tification. This fact is becoming well recognized in practice and was also 
pointed out in Leng, C, Lin, Y., and Wahba, G. (2006), for linear mod- 
els. This motivates the modification we used in Step 3 of our procedure 
described above, where we added a BIC-type penalty to the cross-validated 
loss function. BIC-type methods have been used successfully for correct vari- 
able identification in a large variety of contexts, and we refer to Bunea and 
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McKeague (2005) and the references therein for related results in prospective 
models. The numerical experiments presented in Section 6 below indicate 
very strongly that the same is true for our variable selection procedure, ap- 
plied to the logistic likelihood and case-control type data. The theoretical 
analysis of this procedure is beyond the scope of this paper and will be 
addressed in future work. 



6.1. Simulation design. In this section we illustrate our proposed method- 
ology via simulations, for the three data generating mechanisms described 
below. We generate X = {Xi, . . . ,Xm) G ^^'^ according to one of the 
following three distributions: 

(1) SNP: Xi,. . . ,Xm are i.i.d. as U, where U is the discrete random 
variable having zero mean and variance 1: 



We used the label SNP for the first type of X we considered, as this type 
of distribution is encountered in the analysis of Single Nucleotide Polymor- 
phism type data; the i.i.d. assumption is not typically met for the basic 
SNP data sets, but is instead met for what is called tagging SNPs, which 
are collections of SNP representatives picked at large enough intervals from 
one another to mimic independence; we used the label SNP here for brevity. 

The parameters of the other two distributions are chosen as : = 1 and 
S = with p = 0.5. 

For each of the three distributions of X given above we generated two sam- 
ples, of size n each, from fo{x) and described in (jl.ip . respectively, for 
given P(y = 1) = vr and using the logistic link (jl.2p . 

The non-zero entries of f3 were set to 1, that is l3j = l,j £ I*. In our 
experiments, we took vr = P{Y = 1) = 0.01, i.e. we assumed a rare incidence 
of the disease. The value of 6o given by (jl.2p was found numerically in order 
to obtain the desired incidence vr. 

6.2. The behavior of (3'x. The quality of the estimators of the odds ratio 
R{x), for a given baseline xq, is dictated by the quality of (d'x as an estimator 
of P*'x. Notice that the sup-norm consistency of /3'x that follows from 
Corollarv l3.2l and Theorem l3.3l (i) above immediately implies its consistency 
in empirical norm or mean squared error MSE = ^ Yli=i(.f^' -^i ~ f3*'Xi)^ . 
We present below a numerical study of the mean squared error MSE for 
k* = 2> and k* = 10, for different values of n and M, with M possibly larger 
than n. In all the results of this section we report the median MSEs over 
200 simulations. 



6. Numerical Experiments 




1/2 








(2) NORJID: X - iV(0, cr^/jv^). 

(3) NOR.CORR: X ~ Af(0, S). 
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100 120 140 160 180 200 220 240 260 280 300 500 1000 1500 2000 2500 3000 3500 4000 



Figure 2. The median error vs. number 2n of observations, 
M = 2000. Left: k* = 3, right: k* = 10. 



Figure [2] shows the decrease of the MSE as the sample size increases, when 
M is kept fixed and set to M = 2000 in this example. Figure [3] shows that 
for given configurations {k* , 2n) the size of the MSE is essentially unaffected 
by an increase in the number of predictors M from 250 to 2000. 



- SNP 

- Normal iid 

■ Normal corrGlated 



400 600 800 1000 1200 1400 1600 1800 2000 




1000 1200 1400 1600 1800 2000 



Figure 3. The median error vs number M of predictors. 
Left: 2n = 300, k* = 3, right: 2n = 1000, k* = 10. 



Our findings are consistent with the behavior of the Lasso estimates in 
other generalized linear models and with our Theorem [33] of Section 3. The 
overall conclusion is that the size of the MSE is influenced by the model size 
k* and not by M, the total number of variables in the model. When the 
variables Xj are correlated, the MSE values are higher for the larger value 
of k* = 10 and very similar to the MSE values obtained for the uncorrelated 
variables when k* = 3. This supports the fact that the dependency struc- 
ture of the variables Xj^s per se does not have a dramatic impact on the 
MSE, as suggested by our theoretical results of Section 3, which hold under 
the very mild Condition H on the design variables. 



6.3. Variable selection accuracy. In this section we consider the same 
simulation scenario as above, and we shift focus to the quality of variable 
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selection. In Figure H] below we investigated the sensitivity of the estimated 
probability of correct variable selection, /* = /, or correct inclusion, /* C /, 
as we varied M = 250, 500, 1000, 2000. In the interest of space, we only 
report below the cases k* = 3, 2n = 300 (left column) and k* = 10, 2n = 
1000 (right column). 



— SNP 

— Normal iid 

■ - Normal corrGlated 



400 600 800 1000 1200 1400 1600 181 



— SNP 

— Normal iid 

■ - Normal corrGlated 



1000 1200 1400 1600 1800 2000 



— SNP 

— Normal lid 

— - Normal correlated 



800 1000 1200 1400 1600 1800 2000 



— SNP 

— Normal lid 

— - Normal correlated 



1000 1200 1400 1600 



Figure 4. The percentage of times / = /*(top row) and 
I* ^ /(bottom row) vs number M of predictors. Left column: 
k* = 3, 2n = 300, right column: k* = 10, 2n = 1000. 

We present, in Figure [5] below, graphs of the percentage of times I* = I 
and, respectively, /* C I, for M = 2000, as we varied n, for k* = 3 and 
k* = 10. For ease of reference, we summarized in Table [2] the sample sizes 
needed to identify the true model at least 90% of the time. 



SNP NORJID NOR_CORR 



k* =3 
k* ^ 10 



200 
800 



250 
1000 



300 
1000 



Table 2. Sample sizes needed for 



/*) > 0.90; M = 2000. 



The results of this section support strongly the theoretical results of Sec- 
tion 4. Our method can identify the correct model with high probability. 
The accuracy of selection is influenced by the true model size k* and the 
dependence structure of the variables Xj, as shown in Figure [5] and it is 
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100 120 140 160 180 200 220 240 260 280 300 500 1000 1500 2000 2500 3000 3500 4000 




100 120 140 160 180 200 220 240 260 280 300 500 1000 1500 2000 2500 3000 3500 4000 



Figure 5. The percentage of times / = /*(top row) and 
I* ^ /(bottom row) vs. number 2n of observations; M = 
2000. Left column: k* = 3, right column: k* = 10. 



much less influenced by an increase in the total number of the variables in 
the model, as shown in Figure HI 

7. Conclusions 
We summarize our overall contributions in this section. 

1. In this article we offered a theoretical analysis of the quality of model 
selection-type estimators in case-control studies. Our focus has been on 
optimizers of the ii regularized logistic likelihood. We showed that these 
estimators, analyzed under the case-control sampling scheme have model 
selection properties that are similar to those of the estimates analyzed un- 
der the prospective sampling scheme. In particular, we established the after 
model selection consistency of the odds ratio, and the consistency of subset 
selection. To the best of our knowledge, this is the first such theoretical 
analysis conducted for this sampling scheme. 

2. We introduced a computationally efficient variable selection and dimen- 
sion reduction procedure that uses a generalization of the Bisection Method, 
the GBM. We used the GBM to find simultaneously M tuning parame- 
ters, each yielding an estimator with exactly k non-zero entries, 1 < /c < M. 
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The final estimator is selected from the set of these M candidates as the 
minimizer of a p-fold cross-validated log-likelihood to which we added a 
5/C-type penalty. This technique is general and can be used in connec- 
tion with other loss functions and sparsity inducing penalty terms. The full 
theoretical investigation of this promising method is the subject of future 
research. All our simulation experiments indicate very good performance of 
this procedure in all the scenarios we considered. Moreover, our technique 
provides important computational savings over grid search based methods. 

Appendix 

Proof of Lemma l2.1L Let {5, /5) and qhe given by (12. ip . We show that the 
maximum value of log Lj.etros(<5, /?, (?) +pen{j3), over all {6,P,q) that satisfy 
the constraint (|1.4p . is bounded above and below by log Lretrosi^, P,q) + 
pen{f3). The bound from above is immediate, as a constraint maximum is 
always smaller or equal than the unconstrained maximum. To show the 
bound from below, we only need to verify that {5, (3, q) given by (|2.ip satisfy 
the constraint (jl.4p . 

Let G{5, (3) = log Lpros{S, j3)+pen{l3). If {6, (3) are given by (|2.ip . and pen{(3) 
is not a function of 6 then, in particular, we have : 

91ogG(?,/5) d\ogLpros(53) V^-/ V^-^ u 



1=1 i=l 



and also 



n 

n 

i=l i=l 



where pj denotes pj evaluated at {5, (3). Recall that the maximum likelihood 
estimator of q is 

1 / " n \ 

\i=l i=l ) 

where ba denotes the Dirac function. Then, condition (jl.4p is satisfied by 
(5, ^, since 

/ Pj{x)q{x)dx = — ^pj{x°) + ^Pj{x}) = -. 

\i=i i=i / 

This concludes the proof of this Lemma. ■ 



Proof of Theorem 13.11 We begin by introducing the notation used in the 
sequel. Let =: ^ Yll=i ^'^'^ —'■ ^ Sr=i ^x'^ denote the empirical 
measures associated with the two samples. For any function g of generic 
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Recall that we defined the functions 



argument x we will use the notation ¥ng{x) = ^ XliLi 5'(^«)i 3 — 0' 1- 



lo{0) =: eoiO;x) = log(l + e^'^), 
£i{e) =: h{0;x) = log(l + e^'^) - 

and recall the notation 9 = Then, by the definition of the estimator 

we have 

1 ^ 1 

(7.1) -(F%(e)+Fihi9))+2rY,m < - (P% W + Pi^i(^)) 

M 

for all 9 = {S,P). In particular, for ^ = 0, this shows that 

2^W<^ and |5|<^— , 
i=i 

where L is a common bound on Xfj,Xlj for all i and j. Therefore, the 
estimator is effectively computed on the parameter set 

(7.2) C=|/3GM^^5GM: E l/^^l ^ ^ ^ ^ 

and it is therefore enough to restrict our study to this set. 

Recah that we defined A{9,9*) in as 

A(^, 9*) = (io{9) - 4(r )) + ^Pi (^h{9) - h{9 



Note further that since (jT.ip holds for all 9 it holds in particular for for 

:fU\dj-P*\ 



r. Then, by adding A(9,9*) + rj^jli \'0j - P*\ to both sides of ([7T]) 



and re-arranging terms we obtain 
M 



{7.3)rf2Wj - P]\ + mo*) < Ik - P°) {(o{0*) - eo{0) 

+ i(pi-pi) [^£,{9*) - h{9)) 



M M M 



+ r^\f3,-P*\+2r^\(3*\-2r^\(3, 
j=i j=i j=i 
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Let e > be a quantity that will be made precise below. Then, for 6* G C 
given by ()7.2p above we define 

,0 _ . |(rg-p°)(i„(o-)-io(0))| 



L\ = sup 



log n 



where \v\i = Yl'j=i l^jl denotes the ii norm of a generic vector v S M'^, for 
some d. Define the events 

(7.4) So = {C < r}, £i = {Ci < r} . 
Then, on £q H £i display (j7.3p above yields 

M 

(7.5) rJ20j-Pj\+^ilO*) 

M M M 

< 2r^|^,-/3;|+2r5^|/?;|-2r^|^,-| 
j=i j=i j=i 

15- (5*1 

+ r— h re. 

logn 

We will argue below that: 

Fact 1. A(9,e*) > 0. 

Fact 2. F{£1^) and P(ff) ^ 0. 

Assume that Fact 1 and Fact 2 hold. Then, if Fact 1 holds, both terms in 
the left hand side of display (j7.5p are positive. Thus, in particular, display 
()7.5p yields on the event <?o H £i that 

M M M M 

A{9, r ) < 2r ^ + 2r ^ |/?*| + 2r ^ |/?*| - 2r ^ |^,-| 
j=i j=i j=i j=i 

\d-6*\ 

+ r— h re 

log n 

1 5 — 5* I 

< 4r "S^ I/?* I + r^ ^ + re. 

^-^ ■' log n 

Recall now that we have assumed that maxjg/» \(3*\ < B, \5*\ < B, for some 
positive constant B and that 5 £ C, and so \5\ < ^ ^ • Thus, on the event 
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SqCi £i, we have 

A(^, r ) < Ark*B + ^^^^ + r-^ + re. 

2 log n log n 

Then, for our choice of r, under the assumption that rk* and for the 
choice of e given below in (|7.1ip . the right hand side of the above display 
converges to zero with n. Therefore, for any a > we have the desired 
result: 

(7.6) P(A(^, r ) > a) < F{£^) + P(f^) — > 0, 

provided Fact 2 holds. We argue in what follows that the two facts hold. 

Proof of Fact 1. Recall the definition (|1.3p of fo{x), fi{x), q{x) and pi(x), 
and notice that we have /o(a;) + fi{x) = 2(7(3;). Let 9'x be an intermediate 
point between 9*'x and 9'x. Then, a second order Taylor expansion gives: 

(7.7) A(^, 9*) = j {\og{l + exp(^'x)) - log(l + exp(6'*'2;))) q{x)dx 

{9'x — 9*' x)pi{x)q{x)dx 

{9'x — 9*' x)pi{x)q{x)dx 



1 I exp(g^x) ^ 2 

2 J (l+exp(0'x))2^ ' ' 



exp(6''rc) 2. 



— / {9' X — 9*' x)pi{x)q{x)dx 

(9'x - 9*'xffo{x)dx 



4 7 (1 + exp(0'x))2 

1 [ exp(^'x) 2/ a*' \'i f ( \a 
4 J (1 + exp(t^'x))^ 

which shows that the left hand side is positive. 



Proof of Fact 2. Define the re-scaled empirical processes 

_ \{K-^'){h{9*)-h{9))\ 

and notice that 



(7.8) P(fo^) < P > 



log n 



and ¥{8l) < 



> 



logn 
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The proof of Fact 2 therefore rehes on the control of and Gj^. For this, 
we use the bounded difference inequality, see e.g. Theorem 2.2, page 8 in |9j. 
To apply it we need to evaluate by how much and G^ change if we change 
the i-th variable Xf, Xj , respectively, while keeping the others fixed. Recall 
thatpo = iEl ^=i^x^ is the empirical measure putting mass 1/n at each 

observation Xi. Let P^' be the empirical measure ^ [YllZii-Li + "^xf") 
corresponding to changing the pair X^ to X^' . Then 

(po - p°)(/o(r ) - i,{e)) (p°^ - F°)(/o(r ) - i,{e)) 

\I5- (3*\i + \5-5*\+e \(5- (3*\i + \5-5*\+e 

^ 1 W*-xf) - h{e;x^) - h{e*-xf) + h{e;xf) 

n \I3 - (3*\i + \6 - 8*\+ e 

.79. ^4L \(5-(5*\, + \5-5*\ ^4L 

^ ' ' - n \[3- (5*\i + \5-5*\+e " n ' 

where the inequality follows immediately by a first order Taylor expansion 
and the assumption that all X variables are bounded by L > 1. The calcu- 
lations involving G^ are identical, and yield the bound Therefore, we 
can apply the bounded difference inequality to obtain that 

(7.10) ¥\Gl-¥PQl>u) < exp-|^. 



re-state a version of it here, for ease of reference. 



We will use Lemma 3 in [27J to obtain bounds on E G^ and E G„. We 



Let Jn he an integer such that 2-^" > n and < e < Then, if the func- 
tions Iq and h defined above are Lipschitz in O'x and the components of x are 
hounded hy L, with prohahility one, then hoth E°G°, E^G^ are hounded hy 

^^^ 21og2(MVn) ^^^ 

2(Mvn)2 ' '"^^c'^c Ci , C2 are positive constants depending 
on the respective Lipschitz constants and L. 

Notice that Iq and h are Lipschitz in t = O'x, with respective constants 1 
and 2. Also, inspection of the chaining argument used in the proof of this 
lemma shows that we can take Jn = (M V n) and 

_ log 2 1 

^ - 2(MVn)+l ^ ~' 



for r given in (j3.3p . which we also recall here 



,„g„ I /2Iog2(MV,,) 1 4^ Ejgjj 

^ ' ^' n 4(M V n) V n 
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Then, making the constants precise in the lemma above and taking 



'21ogi 



n 



(7.12) u = AL] 

display (jT.lOp yields 

IP° f G° > < 6, (oi > < 6, 

V log n y \ logn J 

for any 5 > 0, in particular for any = 5„ — > 0. This display in combination 
with (j7.8p above gives the desired result. This completes the proof of this 
theorem. ■ 



Proof of Corollary 13.21 By definition 

A(e,e*) = J (log{l + exp(9'x))-log{l + exp{e*'x))^q{x)dx 
{9'x — 9*' x)pi{x)q{x)dx 



(9'x - 9* x) - , \^ . , q(x)dx, 

^ ^\l + exp{e'x) l + exp{9*'x)J^^ ^ 

where the last line follows via a first order Taylor expansion. Simple alge- 
braic manipulations show that if sup^ \ 9'x — 9*'x\ > 7, for any 7 > 0, then 
A{9,9*) > 0. Therefore, there exits such that A{9,9*) > a^. Invoking 
Theorem 13.11 above we therefore obtain 

(7.13) P (sup \9'x - 9*'x\ > 7 ) < P (a(9, 9*) > a^) — > 0, 



X / ^ 



which is the desired result. 



Proof of Theorem 13.31 Recall that in (j7.7p above we showed that 

4 J (1 + exp(t^'rE))"^ 

1 f exp{9'x) ^^f^ , 2. 



+ Z / ,,, {e'x-9*'xyMx)dx, 

47 (1 + exp(6''x))^ 
with 9'x being an intermediate point between 9*'x and 9'x. 



Let 7 > be arbitrarily close to zero, fixed. Let = {sup^ \9'x—9*'x\ < 7}. 
By (j7T3]) above we have P(A^) ^1. On we have 9*'x > 9'x - 7, for all 
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X, and therefore 

exp(^'x) ^„ . /l + exp(r'x) 



(l + exp(0'x))2 

exp(^'x - e*'x 



V 1 + exp(t^'x) / 



> 



exp(27) 

> exp( -87)^0(3:^)^1(2;) =: wpq{x)pi{x), 

for all x and with w arbitrarily close to one. Recall that f{x) = 7r/o(x) + 
(1 — 7r)/i(x), for P(y = 1) = vr. Then, on the set we have 

(7.14) Aile*) > J lpoix)pi{x)(9'x-e*'xf{fiix) + foix))dx, 

> — J pq{x)pi{x){6' X — 6*' x)"^ f {x)dx 

> wb(Y,iPj-Pj? + (^-n' 

where the last inequality follows from Condition H. Then, adding and sub- 
tracting r\6 — (5*1 to both sides in ()7.5p and using ()7.14p above we obtain 

M 

r\5-5*\+r^\j3j- (3*\+wh^0j- I3*f + wb{5-5*f 
j=i jai* 

M M M 

< 2r ^ 0, - P*\ + 2r ^ |/?*| - 2r ^ + r(l + |— " 5*1 + re 

j=i j=i j=i ^ 

< 4r V -/?*| +r(l + -^)|?-d*| +re 

^-^ logn 

ie/* 

< ^r^k* +wbY{f3j-p*f + wb{6-6*f, 
bw ^-^ 

je/* 

where we obtained the last line by using the Cauchy-Schwarz inequality 
followed by an inequality of the type 2uv < au^ + /a, for any o > 1. For 
clarity of exposition, we absorbed the term re, which is of order 1/2^^^", 
into the first term, which is much larger. Therefore 

M 

rQ-5*\+r^WT- m\ < —r^k*, 
i=i 

which implies 

.... 6 ... 
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on the set Aj, with P(A^) — > 1, which is the desired result. ■ 

Proof of Theorem mH Since P(/* = /)>!- P(/* ^ /) - P(I ^ I*), 
it is enough to control the two probabilities in the right hand side of this 
inequality separately. 

Control o/P(/* ^ /). 

Recall that we denoted the cardinality of /* by k* . Then, by the definition 
of the sets / and /* we have 



P(/* 2 I) < Fi^k^I for some k e I* 

< k* max F(j3k = and (31^0 

< Mmax P = and / 

as we always have k* < M. Recall that 

log Vos(^) = x;iog(i + e^'^'h - jz^e'x}) + x;iog(i + e^'^'h 

i=l i=l i=l 

and let Ln{0) = \ogLpros{(^)- By standard results in convex analysis, if 
= is a component of the solution 9 = {6, 13) then 



dLn{e) 



< 2r. 



Let k be arbitray, fixed. Define 



(7.15) Tnie) =: Tn 



M / ^ 2n N 

j=0 \ i=l / 



Recall that by convention Xio = 1, for all i, and for compactness of notation 
we will write Oq = 5 and 9q = 6. 
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Assuming without loss of generality that the data has been scaled such that 
^ X^r=i -^ik ~ ^ ^' therefore obtain, for every k ^ I* , that 



P /3fc = and / 



< 



< 



< 



Sn\ \Sn Tn\ 



dl3k 
dL„ 



\P*k\ 



dpk 

j^k 1=1 



<2r; 131 

<2r; /3^/0 



IS. 



71 n 





< 2r 1 







Therefore 



(7.16)P(r 2 /) < Mmax 



5L„(r) 



M 



> 



l/?fc|-2r 



MmaxP V (9,- - e*\ 



+ Mmaxl 

fee/* 



^ 2n 

1=1 

l/?^|-2r 



ik 



> 



= (/) + (//) + (///). 

In what follows we bound each of the above terms individually. 

Bound on (III). Let 9'Xi be a point between 9'Xi and 9*'Xi, for each i. A 
first order Taylor expansion yields 



Tn 



exp(g"-X,) 

^ ^ ^ ^ ^ _ _ exp(^'X, - r'x, 
2n 



i=o i=l 



(l + exp(0'Xi-0*'X,)) 



,Pi{XMXi){ej 



where the last line follows from the first by dividing and multiplying the 
summands by pi{Xi)pQ{Xi) defined in (|1.3p above. Let 7 be arbitrarily 
close to zero, fixed. Let be the set on which sup^, \9'x — 9*'x\ < 7. 
Corollarv 13.21 above guarantees that F{D^) 0. Since 7 is arbitrarily close 
to zero, the difference 
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with arbitrarily close to 0, for each i, on the set D.y. Therefore we have 

\Pl\-'2r^ 



< 



M 2n . 
j=0 i=l ^ 



pi{Xi)po{Xi 



> 



6 



+ 



M 2n 



2^^ E E DniXijXik(9j - e*^ 

j=0 1=1 



> 



6 



+ 



=: (a) + (6)+P(D!;). 
To bound (a) we first invoke Conditions 1 and 2 to obtain that 

M 



^2^2^^-o^^^ 1 7 



i=0 i=l 



< - V 



where C is a positive constant, independent of n and depending only on the 
constant d of Conditions 1 and 2. Combining this with the assumption that 
miufcg/* 1/3^1 > 4r we arrive at 

(M 
^\Oj-e*\ > ck* 
j=0 

for some positive constant c depending only on d and not on n. 

To bound (6) we recall that all X variables are bounded by L and that \Dni\ 
is bounded by Note that we can always choose = -pr, as we can always 
choose the corresponding 7. Therefore, the resulting bound is 

^ \ 
(6) <P ^1^^ -0*1 >cik*r , 

for some ci > depending on L and not on n. Collecting the bounds above 
we thus have 



M 



M 



(III) < MP ^ \9j -0*\> ck*r + MP ^ \.j , 
\j=o J \j=o 

+ MF{D^) 

< 3M(P(£:°) + P(fi)), 



> cik*r 



where the last line follows from the proofs of Theorem 13. 31 and Corollarv l3.2l 
for sets £^,£^ defined as in (I7.4p above, and with r replaced by the slightly 



DIMENSION REDUCTION IN CASE CONTROL STUDIES 



29 



larger value given in ()4.ip above. Then, as in the proof of Fact 2 of Theorem 
13.11 and for this value of r, we obtain 

OA 

(///) < MI— < 6(5 — >0, 
for any (5 = 5„ ^ 0, as n ^ oo. 

Bound on (II). We reason exactly as above, using now Assumption 2 and 
Condition 1 to obtain, for some constant c > 0, that 

(M 
^\Oj-0j\ > ck*r 
j=0 



(7.17) 

for any 6 = 6n 0^ as n oo. 



9A 

< M(P(£:°) + ¥{£^)) < M— <25 — > 0, 



Bound on (I). First notice that, for any k we have 



U=i 



^xi + ^xipi{xl) 



i=l 



1=1 



Let 



and notice that EZj^ = 0, for all i and k. Since \ Zik\ < 3L, we use Hoeffding's 
inequality to obtain 



> V \ < exp 



df3k 



and this is bounded hy 6/M for any v > 



9L2 



3L^log M/S 



/2n 



. Thus, by Assumption 



2 and our choice of r given in (j4.ip . we have that (/) < 6. 
Collecting now the bounds on (/), (//) and (///) we therefore obtain 

p(r 2 /) < 95 ^ 0, 

for any (5 = (5„ — > 0, as n ^ cxd, which completes the first part of the proof. 



Control ofF{I ^ I*). The control of this quantity will essentially involve 
the usage of the same probability bounds as above. We will however need a 
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different intermediate argument. Let E R'^*"^^, where we denote the first 
component of /i by 6. Define 

{ i=l i=l i=l J 

and define 

(7.18) Jl = aigmin H{fj,), 

where by convention we denote the first component of by 5. Let, by abuse 
of notation, Jl G M^^+-'^ be the vector that has the first component 6, the 
other components of Ji in positions corresponding to the index set I*, and 
is zero otherwise. Define the set 



Bi 



n 



< 2r 



By standard results in convex analysis it follows that, on the set Bi, Jl 
is a solution of (j2.2p . Recall that ^ is a solution of (j2.2p by construction. 
Then, using simple convex analysis arguments as in Proposition 4.2 in Bunea 
(2008), we obtain that any two solutions have non-zero elements in the same 
positions. Since, on the set Bi, 9k = ior k £ I*^ we conclude that I O I* 
on the set Bi. Hence 



(7.19) p(/ 2 r) 




> r 



^ 2n 

In ^ 



> r/2 



+ M max 

k0* 



Tn(M)| >r/2), 



where T„ and Sn have been defined in (j7.15p above. We notice that the 
display (17.19P is almost identical to display (j7.16p above, and so it can be 
bounded in a similar fashion. The only difference is in invoking versions of 
Theorem 13. 31 and Corollarv 13.11 corresponding to 9 replaced by Jl, which hold 
under the same assumptions and for the same tuning parameters. Conse- 
quently 

^ 2 /*) < 95 ^ 0, 



which concludes the proof of this theorem. ■. 
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